Custom OpenAI Chatbot Pt2: Fun with Lang Chain

How-to

Python

The final steps for the creation of a custom Chat GPT Chat bot

Author

Mark Edney

Published

November 29, 2023

Introduction

This is a continuation from a previous post about creating a custom ChatGPT bot with Langchain. From the previous post, we have created a CSV file which contains the text captured from a folder full of different PDF files. We will continue from there, creating a custom ChatGPT bot from that CSV files.

Loading packages

As with any project, the first step is to load some of the necessary python packages. The load_dotenv function from the dotenv plays a special role in this program. In order to use the Open AI API, we will need to create and Open AI account. With this account, you will receive an API key, which will be connected to any functional calls with your account. You can keep these API keys in your code, but it is better practice to store them in your environmental variables. The environmental variables is a location in windows where you can store some sensitive information that will be only stored on your computer. You can find where your environmental variables are with the windows search, or you can find it under your advanced system properties. Make a new entry in either the user variables or system variables (whichever you prefer) and name it “OPENAI_API_KEY”. The value will be the actual key you get from your Open AI account. After you save your API key, you will need to restart windows. You will need to pay for your API requests for your Open AI account. Using the embedding model turns out to be pretty cheap, while questioning the ChatGPT cost a bit more, usually only a one or two cents per question.

Code

import os
import sys
import pandas as pd
from dotenv import load_dotenv

load_dotenv()

False

Document Loading

The next step is to load your CSV files into a document format used by Lang Chain. I found that the best way to complete this is with the DirectoryLoader and the CSVLoader functions. I have also found that it is important to specify the encoding used to create the CSV files from the pandas dataframes. Furthermore, I’ve also included an additional step which will add 1 to the row numbers in the metadata. The row value represents the page number within the document, but the default indexing for a pandas dataframe starts at 0 and not at 1. Getting the length of our documents represents the overall number of pages that we have scanned from our PDF files.

Code

from langchain.document_loaders import DirectoryLoader, CSVLoader

loader = DirectoryLoader(path='C:/Users/Mark/Desktop/csv_files', glob = '**/*.csv', loader_cls = CSVLoader, use_multithreading = True, show_progress = True, silent_errors = True, loader_kwargs = {'encoding': 'utf-8-sig'})

docs = loader.load()

for doc in docs:
  doc.metadata['row'] +=1
print(len(docs))

  0%|          | 0/3 [00:00<?, ?it/s]100%|██████████| 3/3 [00:00<?, ?it/s]

Just to get an idea of the data format, we can print the 10 item stored in our documents.

Code

docs[10]

Document(page_content=": 7\n0: Chatbot - Wikipedia https://en.wikipedia.org/wiki/Chatbot\n\nThe creation and implementation of chatbots is still a developing area, heavily related to artificial intelligence and machine learning,\nso the provided solutions, while possessing obvious advantages, have some important limitations in terms of functionalities and use\ncases. However, this is changing over time.\n\nThe most common limitations are listed below:!83]\n\n= As the input/output database is fixed and limited, chatbots can fail while dealing with an unsaved query.!52]\n\n= A chatbot's efficiency highly depends on language processing and is limited because of irregularities, such as accents and\nmistakes.\n\n= Chatbots are unable to deal with multiple questions at the same time and so conversation opportunities are limited.|83I\n\n= Chatbots require a large amount of conversational data to train. Generative models, which are based on deep learning\nalgorithms to generate new responses word by word based on user input, are usually trained on a large dataset of natural-\nlanguage phrases. |S]\n\n« Chatbots have difficulty managing non-linear conversations that must go back and forth on a topic with a user.|&41\n\n= As it happens usually with technology-led changes in existing services, some consumers, more often than not from older\ngenerations, are uncomfortable with chatbots due to their limited understanding, making it obvious that their requests are being\ndealt with by machines. |83]\n\nChatbots and jobs\n\nChatbots are increasingly present in businesses and often are used to automate tasks that do not require skill-based talents. With\ncustomer service taking place via messaging apps as well as phone calls, there are growing numbers of use-cases where chatbot\ndeployment gives organizations a clear return on investment. Call center workers may be particularly at risk from AI-driven\nchatbots.!85]\n\nChatbot jobs\n\nChatbot developers create, debug, and maintain applications that automate customer services or other communication processes.\nTheir duties include reviewing and simplifying code when needed. They may also help companies implement bots in their operations.\n\nA study by Forrester (June 2017) predicted that 25% of all jobs would be impacted by AI technologies by 2019.!86]\n\nSee also\n\n= Applications of artificial intelligence =» Autonomous agent\n\n8 of 18 2023-11-10, 9:32 a.m.", metadata={'source': 'C:\\Users\\Mark\\Desktop\\csv_files\\Chatbot - Wikipedia.csv', 'row': 8})

Document Transformation

We could work with the documents we have just as they are now, but it is sometimes difficult to get the proper level of context from a question from an entire page of information. One way to rectify this is by using the ReqcursiveCharacterTextSplitter function which will break down the data into smaller sections. The function has some built-in breakpoints point like section titles and ends of paragraphs. We could then use the smaller split up documents to makes sure our search’s are more accurate. I have found that using both the overall page documents for overall context and the smaller in-depth documents for smaller details together works best. We see that this drastically increases the number of documents that we need to search through.

Code

from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size = 512, chunk_overlap = 100)

documents = text_splitter.split_documents(docs)
documents = documents + docs
len(documents)

Text Embeddings

Text embeddings are the real power behind all natural language processes, including chatbots. A text embedding is a vector that represents the meaning behind a text and not the text itself. In this way, we can search through the meaning of our documents for an answer without needing to match the text exactly. In order to use the Open AI API to answer our questions, we need to utilize the Open AI Embeddings model. The use of it is pretty simple, when send them a document and they return the embedding.

Code

import openai
from langchain.embeddings import OpenAIEmbeddings

openai.api_key = os.getenv("OPENAI_API_KEY")
embeddings_model = OpenAIEmbeddings(request_timeout = 60)

Vector Store

The next step is not 100% necessary, but it will make our lives much easier. What we need to do is to create a database that connects each of the documents with their embedding. We can do this by creating a vector store. There are a few different vector stores, the one I chose to us is `FAISS’. All we need to do is define the vector store, send it our stored documents and send the connection to our embedding model. The function will then process each document for us.

Code

from langchain.vectorstores import FAISS

db = FAISS.from_documents(documents, embeddings_model)

MultiQuery Retriever

What we have now is a database that we can search for ourselves if we like. You can query the vector store, convert that query into an embedding, and search the vector store for matches. The return from this query will then be a list of documents in the vector store, which maybe useful, but can be difficult to interpret by itself. What a custom chatbot does is use this basic search results and supply it to a Large Language Model (llm) to make sense of it. We would like our Chatbot to have the entire vector store to look through to answer our questions, but there are some limitations to how much context we can provide to Open Ai.

For reference, the Chat GPT3 API has a limit of about 4000 tokens. What are tokens? Well, tokens are small sections of text. They are usually longer than 1 character but smaller than most words. One thing to keep in mind is that the 4000 token limitation covers everything, including context provided, the query and the answer returned. Do to this limitation, we need to do a search through our vector store and limit the context that we send to the Open AI API.

We could perform a basic query, but I prefer to create a Multi-query. In this query, we are actually asking ChatGPT to create different variations to our query. We can then search our vector store on multiple queries are return the best results over all queries. There is some expense to this as it requires an additional API call to generate the new queries.

We can then send our question to our retriever which will automatically perform our search over the vector store, I use the default prompt as it seems to do a good job. The custom Chat bot will then return the answer to our question, based on the context provided by the vector store search, with the source documents it used to generate an answer.

Code

from langchain.chains import RetrievalQA
from langchain.retrievers.multi_query import MultiQueryRetriever
from langchain.llms import OpenAI

retriever_from_llm = MultiQueryRetriever.from_llm(
  retriever = db.as_retriever(search = 'mmr', search_kwags = {'k': 10}),
  llm = OpenAI())
  
question = "What do Chat bots do?"
qa = RetrievalQA.from_chain_type(llm = OpenAI(),
chain_type = "stuff",
retriever = retriever_from_llm,
return_source_documents = True)

ans = qa({"query":question}, return_only_outputs = True)
print(ans)

{'result': ' Chatbots are software applications or web interfaces that aim to mimic human conversation through text or voice interactions. Modern chatbots are typically online and use artificial intelligence (AI) systems that are capable of maintaining a conversation with a user in natural language and simulating the way a human would behave as a conversational partner. They are used for B2C customer service, sales and marketing, as well as in toys, devices, and other applications. They can also be used to fill chat rooms with spam and advertisements, by mimicking human behavior and conversations or to entice people into revealing personal information.', 'source_documents': [Document(page_content='A chatbot (originally chatterbot!!) is a software application or web interface that aims to\nmimic human conversation through text or voice interactions.l2I[3Il41 Modern chatbots are\ntypically online and use artificial intelligence (AI) systems that are capable of maintaining a\nconversation with a user in natural language and simulating the way a human would behave as a\nconversational partner. Such technologies often utilize aspects of deep learning and natural', metadata={'source': 'C:\\Users\\Mark\\Desktop\\csv_files\\Chatbot - Wikipedia.csv', 'row': 1}), Document(page_content=': 3\n0: Chatbot - Wikipedia https://en.wikipedia.org/wiki/Chatbot\n\nIn 2016, Facebook Messenger allowed developers to place chatbots on their platform. There were 30,000 bots created for Messenger\nin the first six months, rising to 100,000 by September 2017.!27]\n\nSince September 2017, this has also been as part of a pilot program on WhatsApp. Airlines KLM and Aeroméxico both announced\ntheir participation in the testing;!281[29l[30I[31] oth airlines had previously launched customer services on the Facebook Messenger\nplatform.\n\nThe bots usually appear as one of the user\'s contacts, but can sometimes act as participants in a group chat.\n\nMany banks, insurers, media companies, e-commerce companies, airlines, hotel chains, retailers, health care providers, government\nentities and restaurant chains have used chatbots to answer simple questions, increase customer engagement,!32! for promotion, and\nto offer additional ways to order from them.!33]\n\nA 2017 study showed 4% of companies used chatbots.!34] According to a 2016 study, 80% of businesses said they intended to have\none by 2020,[35]\n\nAs part of company apps and websites\n\nPrevious generations of chatbots were present on company websites, e.g. Ask Jenn from Alaska Airlines which debuted in 200834 or\nExpedia\'s virtual customer service agent which launched in 2011.[361[37] The newer generation of chatbots includes IBM Watson-\npowered "Rocky", introduced in February 2017 by the New York City-based e-commerce company Rare Carat to provide information\nto prospective diamond buyers.!381[39]\n\nChatbot sequences\n\nUsed by marketers to script sequences of messages, very similar to an autoresponder sequence. Such sequences can be triggered by\nuser opt-in or the use of keywords within user interactions. After a trigger occurs a sequence of messages is delivered until the next\nanticipated user response. Each user response is used in the decision tree to help the chatbot navigate the response sequences to\ndeliver the correct response message.\n\nCompany internal platforms\n\nOther companies explore ways they can use chatbots internally, for example for Customer Support, Human Resources, or even in\nInternet-of-Things (IoT) projects. Overstock.com, for one, has reportedly launched a chatbot named Mila to automate certain simple\nyet time-consuming processes when requesting sick leave.[4°] Other large companies such as Lloyds Banking Group, Royal Bank of\n\n4 of 18 2023-11-10, 9:32 a.m.', metadata={'source': 'C:\\Users\\Mark\\Desktop\\csv_files\\Chatbot - Wikipedia.csv', 'row': 4}), Document(page_content='of chatbots). Chatbots can also be designed or customized to further target even more specific\nsituations and/or particular subject-matter domains.!Z]', metadata={'source': 'C:\\Users\\Mark\\Desktop\\csv_files\\Chatbot - Wikipedia.csv', 'row': 1}), Document(page_content=': 2\n0: Chatbot - Wikipedia https://en.wikipedia.org/wiki/Chatbot\n\n(though the program as released would not have been capable of doing so).{15]\n\nFrom 197816] to some time after 1983,!7] the CYRUS project led by Janet Kolodner constructed a chatbot simulating Cyrus Vance\n(57th United States Secretary of State). It used case-based reasoning, and updated its database daily by parsing wire news from\nUnited Press International. The program was unable to process the news items subsequent to the surprise resignation of Cyrus Vance\nin April 1980, and the team constructed another chatbot simulating his successor, Edmund Muskie.8]57]\n\nOne pertinent field of AI research is natural-language processing. Usually, weak AI fields employ specialized software or\nprogramming languages created specifically for the narrow function required. For example, A.L.I.C.E. uses a markup language called\nAIML,3] which is specific to its function as a conversational agent, and has since been adopted by various other developers of, so-\ncalled, Alicebots. Nevertheless, A.L.I.C.E. is still purely based on pattern matching techniques without any reasoning capabilities, the\nsame technique ELIZA was using back in 1966. This is not strong AI, which would require sapience and logical reasoning abilities.\n\nJabberwacky learns new responses and context based on real-time user interactions, rather than being driven from a static database.\nSome more recent chatbots also combine real-time learning with evolutionary algorithms that optimize their ability to communicate\nbased on each conversation held. Still, there is currently no general purpose conversational artificial intelligence, and some software\ndevelopers focus on the practical aspect, information retrieval.\n\nChatbot competitions focus on the Turing test or more specific goals. Two such annual contests are the Loebner Prize and The\nChatterbox Challenge (the latter has been offline since 2015, however, materials can still be found from web archives)./291\n\nChatbots may use neural networks as a language model. For example, generative pre-trained transformers (GPT), which use the\ntransformer architecture, have become common to build sophisticated chatbots. The "pre-training" in its name refers to the initial\ntraining process on a large text corpus, which provides a solid foundation for the model to perform well on downstream tasks with\nlimited amounts of task-specific data. An example of a GPT chatbot is ChatGPT.[2°] Despite criticism of its accuracy, ChatGPT has\ngained attention for its detailed responses and historical knowledge. Another example is BioGPT, developed by Microsoft, which\nfocuses on answering biomedical questions.!21I[22]\n\nDBpedia created a chatbot during the GSoC of 2017./231[241I25] Tt can communicate through Facebook Messenger.\n\nApplication\nMessaging apps\nMany companies’ chatbots run on messaging apps or simply via SMS. They are used for B2C customer service, sales and\n\nmarketing. [26]\n\n3 of 18 2023-11-10, 9:32 a.m.', metadata={'source': 'C:\\Users\\Mark\\Desktop\\csv_files\\Chatbot - Wikipedia.csv', 'row': 3}), Document(page_content=': 2\n0: Chatbot - Wikipedia https://en.wikipedia.org/wiki/Chatbot\n\n(though the program as released would not have been capable of doing so).{15]', metadata={'source': 'C:\\Users\\Mark\\Desktop\\csv_files\\Chatbot - Wikipedia.csv', 'row': 3}), Document(page_content=": 6\n0: Chatbot - Wikipedia https://en.wikipedia.org/wiki/Chatbot\n\nIn India, the state government has launched a chatbot for its Aaple Sarkar platform,!Z] which provides conversational access to\ninformation regarding public services managed.!721I73]\n\nToys\nChatbots have also been incorporated into devices not primarily meant for computing, such as toys.[74]\n\nHello Barbie is an Internet-connected version of the doll that uses a chatbot provided by the company ToyTalk,!75] which previously\nused the chatbot for a range of smartphone-based characters for children.!76] These characters' behaviors are constrained by a set of\nrules that in effect emulate a particular character and produce a storyline.!77]\n\nThe My Friend Cayla doll was marketed as a line of 18-inch (46 cm) dolls which uses speech recognition technology in conjunction\nwith an Android or iOS mobile app to recognize the child's speech and have a conversation. It, like the Hello Barbie doll, attracted\ncontroversy due to vulnerabilities with the doll's Bluetooth stack and its use of data collected from the child's speech.\n\nIBM's Watson computer has been used as the basis for chatbot-based educational toys for companies such as CogniToys!74] intended\nto interact with children for educational purposes.!78]\n\nMalicious use\n\nMalicious chatbots are frequently used to fill chat rooms with spam and advertisements, by mimicking human behavior and\nconversations or to entice people into revealing personal information, such as bank account numbers. They were commonly found on\nYahoo! Messenger, Windows Live Messenger, AOL Instant Messenger and other instant messaging protocols. There has also been a\npublished report of a chatbot used in a fake personal ad on a dating service's website.[79]\n\nTay, an AI chatbot that learns from previous interaction, caused major controversy due to it being targeted by internet trolls on\nTwitter. The bot was exploited, and after 16 hours began to send extremely offensive Tweets to users. This suggests that although the\nbot learned effectively from experience, adequate protection was not put in place to prevent misuse.[8°]\n\nIf a text-sending algorithm can pass itself off as a human instead of a chatbot, its message would be more credible. Therefore,\nhuman-seeming chatbots with well-crafted online identities could start scattering fake news that seems plausible, for instance\nmaking false claims during an election. With enough chatbots, it might be even possible to achieve artificial social proof.[81J[82]\n\nLimitations of chatbots\n\n7 of 18 2023-11-10, 9:32 a.m.", metadata={'source': 'C:\\Users\\Mark\\Desktop\\csv_files\\Chatbot - Wikipedia.csv', 'row': 7}), Document(page_content='Chatbots and jobs\n\nChatbots are increasingly present in businesses and often are used to automate tasks that do not require skill-based talents. With\ncustomer service taking place via messaging apps as well as phone calls, there are growing numbers of use-cases where chatbot\ndeployment gives organizations a clear return on investment. Call center workers may be particularly at risk from AI-driven\nchatbots.!85]\n\nChatbot jobs', metadata={'source': 'C:\\Users\\Mark\\Desktop\\csv_files\\Chatbot - Wikipedia.csv', 'row': 8}), Document(page_content='Many high-tech banking organizations are looking to integrate automated AI-based solutions such as chatbots into their customer\nservice in order to provide faster and cheaper assistance to their clients who are becoming increasingly comfortable with technology.\nIn particular, chatbots can efficiently conduct a dialogue, usually replacing other communication tools such as email, phone, or SMS.', metadata={'source': 'C:\\Users\\Mark\\Desktop\\csv_files\\Chatbot - Wikipedia.csv', 'row': 5}), Document(page_content='Chatbot jobs\n\nChatbot developers create, debug, and maintain applications that automate customer services or other communication processes.\nTheir duties include reviewing and simplifying code when needed. They may also help companies implement bots in their operations.\n\nA study by Forrester (June 2017) predicted that 25% of all jobs would be impacted by AI technologies by 2019.!86]\n\nSee also\n\n= Applications of artificial intelligence =» Autonomous agent\n\n8 of 18 2023-11-10, 9:32 a.m.', metadata={'source': 'C:\\Users\\Mark\\Desktop\\csv_files\\Chatbot - Wikipedia.csv', 'row': 8})]}

If you are wondering what the default query looks like, you can use the following code. You can create your own prompt if you like, but the default seems to work well for me. With a custom prompt you can customize the output, like make the results return summarized in a Markdown table if you like.

Code

print(retriever_from_llm.llm_chain.prompt)

input_variables=['question'] template='You are an AI language model assistant. Your task is \n    to generate 3 different versions of the given user \n    question to retrieve relevant documents from a vector  database. \n    By generating multiple perspectives on the user question, \n    your goal is to help the user overcome some of the limitations \n    of distance-based similarity search. Provide these alternative \n    questions separated by newlines. Original question: {question}'

Conclusion

Well, there are all the steps needed to create your very own custom ChatGPT bot with Lang chain. We loaded our CSV files, split the documents into smaller section, created a vector store of our documents, search through our vector store with a multi-query and forward the context and our question to the Open AI API. The result is an answer that is limited to the data stored in our PDF documents.

Photo by Google DeepMind